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; ABSTRACT 

This work is concerned with the estimation of multidimensional regression and the asymp- 
totic behaviour of the test involved in selecting models. The main problem with such models 

GO 

is that we need to know the covariance matrix of the noise to get an optimal estimator. We 
show in this paper that if we choose to minimise the logarithm of the determinant of the em- 
pirical error covariance matrix, then we get an asymptotically optimal estimator. Moreover, 
under suitable assumptions, we show that this cost function leads to a very simple asymp- 

f — | totic law for testing the number of parameters of an identifiable and regular regression model. 

\q'. 

psj ■ Numerical experiments confirm the theoretical results. 
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1. INTRODUCTION 

Let us consider a sequence (Y t , Z t ) t£N of i.i.d. (i.e. independent identically distributed) 
random vectors. The law of (Y t , Z t ) e R d x R d ' is the same as the generic variable (Y, Z). 



o3 ■ We assume that the model can be written 

Yt = F w o(Z t ) + e t , 

where 

• F w o is a parametric function, the true parameters being denoted to . 

• (e t ) is an i.i.d. -centred noise with unknown definite positive covariance matrix r . 



The observations will be denoted with the lower-case letters ((z t , Z/t)) 1<t<n - This notation 
allows us to consider a wide range of regression models. For example F w o can be a vector with 
d lines and d' columns, the parameter is the components of the matrix and the model will be 
a classical linear model. Another example can be constrained linear models, knowing that 
constraints can also be nonlinear. Finally, F w o can also be a nonlinear parametric function 
like a multilayer perceptron (MLP). An MLP with H hidden units (see Rumelhart et al. 
(1986)) is defined by a family of functions 

H 

F w (z) = ^2 bjs(ajz + Cj) + d, 
h=i 

where T denotes the transposition, z e W 1 ' , s(t) = tanh(t) and 

w = (ai,-- - , 0^, ,b H ,c ir -- ,c H ,d) e R Hd ' x R 2H+1 . We will focus on the MLP 

example, because it is a widely used tool for nonlinear regression (see White (1992)), but it 
could be any other non linear and differentiable model. 

Note that, for an MLP function, there exists a finite number of transformations of the 
weights leaving these functions unchanged; these transformations form a finite group (see 
Sussmann (1992)). Therefore, we will consider equivalence classes of MLPs: two MLPs are 
in the same class if the first one is the image by such a transformation of the second one. 
The set of parameters considered is then the quotient space of parameters. In the sequel, we 
will assume that the model is identifiable: 

F Wl (Z) a d- F W2 (Z)^ Wl = w 2 . (2) 

For example, this can be done if we consider one-hidden-layer MLPs with the true number 
of hidden units and with parameters in the quotient space. Another example is the linear 
regression function, with or without constraint. The consequence of the identifiability of the 
model is that, in most cases, the Hessian matrix of the model will be definite positive. Also, 
note that it is not hard to generalise all that is shown in this paper for stationary mixing 
variables and therefore for time series. For example, let us assume that the regression 
function verifies ||F^o(,2)|| < a||z|| + b, with < a < 1 and b e K. Let (Y t ) t eN be the 
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stationary solution of the equation: 

Y t = F vfl (Y t _ 1 )+e t , 

where the noise (e t )teN has a positive density everywhere with respect to the Lebesgue 
measure and is with an order moment strictly larger than 1. It is well known (see Duflo 
(1997)), that(Yt)t e N will be geometrically ergodic and verifies a strong law of large numbers. 
In particular, as MLPs are bounded functions, if F w o is an MLP function, all the proofs 
given in this paper will be valid exactly in the same way as in Yao (2000). 

1.1. EFFICIENT ESTIMATION 

The estimation of the model (1) is done by minimising a suitable cost function with 
respect to the parameters. A common choice for the cost function is the mean square error 
(MSE): 

Vn (w):=-J2\\y t -F w (z t )\\ 2 , 

t=l 

where ||.|| denotes the Euclidean norm on R d . This function is widely used because in the 
linear case without constraint on the parameters this cost function is optimal (see Lutkepohl 
(1993)). In fact, this cost function gives a satisfactory estimator when there is one and 
only one estimator which minimises the trace of the covariance matrix of the noise (see 
Magnus and Neudecker (1988)). However in other cases (constraint linear model, non-linear 
regression, etc.) it leads in general to a suboptimal estimator (see, for example, Lutkepohl 
(1993) for the constraint linear model). Then, a better solution is to use an approximation 
of the covariance error matrix to compute the generalised least square estimator: 

1 n 

- X> " (z t )) T T- 1 (y t - F w (zt)) , 
t=i 

where T has to be a good approximation of the true covariance matrix of the noise r . 
For example, if we use a sequence of matrices T n converging in probability to r , it is easy 
to show (see Chapter 5 in Galland (1987)) that the estimator obtained by minimising the 
cost function: 

n 

-J2(yt-F w (z t )) T r-\ yt -F w (z t )) 
n t=l 
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has the same asymptotic properties as the estimator which minimises 

1 n 

- V> - F w (z t )) T IV (y t - F w (z t )) . 



n 
t=i 



There are many ways to construct a sequence of (r/ c ) fceN » yielding an approximation of r . 
The simplest is to use the ordinary least square estimator : = arg min ^ Ylt=i 1 1 2/* — F w ( z t) \\ 2 , 
in order to estimate the covariance matrix of the noise: 

^ := r (Wl) := I f> - F Wi (z t ))(y t - F w ,(z t )) T . 

Moreover, we can use this new covariance matrix to find a generalised least square estimator 
W 2 : 

n 

1 n 

W 2 = arg min - £ (y t - F w (z t )) T (r*) _1 (y t - F w (z t )) 

t=l 

and calculate again a new covariance matrix 

:= r (w 2 ) = x - f> - F w ,(z t ))( yt - F w >(z t )) T . 
t=i 

Finally, it can be shown (see Gallant (1987)) that this procedure gives a sequence of param- 
eters 

jyi _> r 1 -> w 2 -> r 2 -> • • • 

minimising the logarithm of the determinant of the empirical covariance matrix: 



U n H := logdet ^ J2(y t - F w (z t ))(y t - F w (z t )) T ^j . 



(3) 



Hence, the cost function U n (w) is the same as the generalised least square cost function 
with the best approximation of the true covariance matrix calculable with the available 
data. Nevertheless it is important to note that the matrix T n is always a function of model 
parameters and it will be better to write T n (w) instead of T n . The asymptotic study of 
the model must take into account the dependency of T n on these parameters, and the real 
function to study is in fact: 



1 n 

-Y,(yt-Fw{zt)) T K\w) (y t -F w (z t )). 



n 
t=i 



This difficulty has always been overlooked except when the covariance matrix is included in 
the parameters of the model and this solution leads to consider a pseudo Gaussian likelihood 
as in Gourieroux et al. (1984). However, in this case it is necessary to reinforce the as- 
sumptions on the moment of the noise to obtain the asymptotic normality of the estimated 
covariance matrix. Although the logarithm of the determinant of the empirical covariance 
is known to be related to the concentrated Gaussian likelihood function, it will be better to 
study it directly because such artificial strong assumptions about the noise are not needed. 

For all these reasons, we propose to study the asymptotic properties of the cost function 
U n (w) and the estimator minimising this cost function: W n := argmin U n (w) will be shown 
to have the same asymptotic behaviour as the generalised least square estimator using the 
true covariance matrix of the noise. 

1.2. TESTING THE NUMBER OF PARAMETERS 

The cost function U n (w) is not only optimal in the sense that it has the same asymptotic 
behaviour as the generalised least square estimator using the true covariance matrix of the 
noise, but it also leads to a very simple procedure for testing the nullity of the parameters. 
Let q be an integer smaller than s, we want to test U H : w G <d q C M 9 " against "Hi : w G 
O s C M s ", where Q q and S are compact sets. Hq expresses the fact that w belongs to a 
subset of O s with a lower parametric dimension than s and so that s — q parameters are 
equal to zero. If we consider the classic mean square error cost function: V n (w), we get the 
following test statistic (see Yao (2000)): 



Under the null hypothesis H , it is shown in Yao (2000) that S n converges in distribution to 
a weighted sum of x\ 




i=l 



where the x\i are s — q i.i.d. xl variables and Aj are strictly positive values, different from 



1 if the true covariance matrix of the noise is not the identity matrix. 
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However, if we use the function U n (w) , under H , the test statistic: 

T n — n x I min U n (w) — min U n (w) I 
\wee q wee s J 

will converge to a classical xl- q an d t ne asymptotic level of the test will be very easy to 
compute. This is another advantage of using the proposed cost function. 

Organisation of the paper. In order to prove these properties, the paper is organised 
as follows: First, the main results are stated in three theorems, the first deals with the 
consistency of the estimator minimising U n (w), the second with its asymptotic normality 
and the third with the asymptotic law of the test procedure used to determine the number 
of parameters. Then, the theoretical results are confirmed by numerical experiments. The 
proofs of the theorems involving technical calculation of the first and second derivatives of 
U n (w) are postponed to the appendix. 

2. ASYMPTOTIC PROPERTIES 

First we give the conditions to state the consistency theorem, then to state the asymptotic 
normality theorem of the estimator W n minimising the function U n (w). In the sequel all the 
expectations will be calculated with respect to the true law of (Y, Z). 

Conditions for the consistency (C). 

1. The parameter space W is a compact space included in R K , with K the dimension of 
the parameter vector w. The unique true parameter w° is assumed to be in the interior 
of W. Note that w° is unique because the model is assumed to be identifiable. The 
compactness of the parameter space means that parameters have to be bounded by 
a constant even if this constant can be very large. This is the case in practice if one 
uses a computer, since its numerical precision is finite. This rather classical hypothesis 
is needed to get the Glivenko-Cantelli property, which yields a uniform law of large 
numbers (see van der Vaart (1998)). 

2. The noise of the model e is square integrable. 
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3. For almost all z the function w i— > F w (z) is continuous; moreover there exists a square 
integrable function m such that 

sup ||-F™(V)|| < m(z). 
wew 

These conditions are easily verifiable for regular models. For example, in the case of an 
MLP, it suffices to assume that the variable Z is with finite third order moment. Indeed, 
in this case there exists a constant C such that we have the following inequalities (see Yao 
(2000)): 

sup weW 11^(^)11 < C 
sup„ e wll^ll<C(l + ||Z||) 

^ e wllSgll<^(l + ll^ll 2 ) 

-p^wii4iSfcii<^ii(i + ii^ii 3 ). 

For a linear model it suffices to assume that the variable Z is with finite second order moment. 
Then, we deduce the theorem of consistency: 

Theorem 1. Under the conditions (C), we have: 

W n -> w v . 

Now, we can establish the asymptotic normality for the estimator: 

Conditions for the asymptotic normality (AN). 

1. There exists a square integrable function m 1 such that, for all k e 1, • • • , K: 

SU P ||— a II < m^z). 

2. There exists integrable functions m 2 and m 3 such that for all j, k, I e 1, • • • , K: 

SU P a a - m 2( z ) and SU P o a a - - m ^( z )- 

»ew owjOWk weW OWjdWkdwi 

Thus, we deduce the theorem: 



Theorem 2. Under the conditions (C) and (AN), when n — > oo, 
where, if we note 

dF w (Z) dKjzf 

B(w k ,w l ) := — , 

ow k dwi 

the component (k, I) of the matrix I is: 

tr (T^E(B(wlwf))). 

Remark. If W* is the estimator of the generalised least squares: 

1 n 

W* := argmin - ( Y t ~ K Ft)f ^ Ft ~ F w Ft)) , 
t=l 

then it is easy to check that 

So, has the same asymptotic behaviour as the generalised least square estimator 
with the true covariance matrix Tq 1 which is asymptotically optimal (see for example Ljung 
(1999)). Therefore, the proposed estimator is asymptotically optimal too. 

Asymptotic distribution of the test statistic T n . Let us remind that we want to test 
U H : w e Q q C M 9 " against u Hi : w G S C 1R S " . H expresses the fact that w belongs to a 
subset of S with a parametric dimension smaller than s so that s — q parameters are equal 
to zero. 

Let us write 

W n = argmin„, G e s U n {w) and W® = argmin^geg U n (w), where Q q is viewed as a subset of 
6 S . Under the null hypothesis H , the asymptotic distribution of T n is a consequence of 
Theorem 2. Indeed, if we replace T n by its Taylor expansion around W n and W®, following 
van der Vaart (1998), chapter 16, we have the classical development: 
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Theorem 3. under the conditions (C) and (AN), if we assume that matrix 7 is not singular 
and under the null hypothesis H , we have: 

T n = n <W n - wtflo (W n - Wfy + o P (l) Z x 2 s - q , 

where op(l) means "negligible in probability" and is defined in van der Vaart (1998). 

3. EXPERIMENTAL RESULTS 

3.1. SIMULATED EXAMPLE 

Although the estimator associated with the cost function U n (w) is theoretically better 
than the ordinary mean least square estimator, we have to confirm this fact by simulation 
in some cases. For example, there are some pitfalls in practical situations with MLPs. 

The first point is that we have no guarantee of reaching the global minimum of the cost 
function because we use differential optimisation to estimate W n . Hence, we can only hope 
to find a good local minimum if we use many estimations with different initial weights. 

The second point is the fact that MLPs are black boxes, which means that it is difficult 
to interpret on their parameters and it is almost impossible to compare MLPs by comparing 
their parameters, even if we try to take into account the possible permutations of the weights. 

These reasons explain why we choose to compare the estimated covariance matrices of 
the noise instead of directly comparing the estimated parameters of MLPs. 

The model. To simulate our data, we use an MLP with 2 inputs, 3 hidden units, and 2 
outputs. We choose to simulate a time series, because it is a very easy task as the outputs 
at time t are the inputs for time t + 1. Moreover, with MLPs, the statistical properties of 
such a model are the same as with independent identically distributed (i.i.d.) data. Indeed, 
since the MLP function is bounded and the noise has a density positive everywhere with 
respect to the Lebesgue measure, the time series simulated is an example of a process with 
a geometrically ergodic solution (see Yao (2000)) and verifies a strong law of large numbers. 
The equation of the model is the following 

Yt+i — F WQ (Y t ) + e t+ i, 
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where 



• i^o = (0,0). 



• (Yt) 



l<t<1000> 



Y t e M 2 , 



is the bi-dimensional simulated random process. 



• F wo is an MLP function with weights w randomly chosen between —2 and 2. 



• (e t ) is an i.i.d. Gaussian centred noise with covariance matrix T = 



/ 1.81 1.8 



) 



V 1.8 1.81 



In order to empirically study the statistical properties of our estimator, we make 100 inde- 
pendent simulations of the bi-dimensional time series of length 1000. 

Results. For each time series we estimate the weights of the MLP using the cost function 
U n (w) and the ordinary least square estimator. The estimations were made using the second 
order algorithm BFGS (see Press et al. (1992)), and for each estimation we chose the best 
result obtained after 20 random initialisations of the weights in the hope of avoiding to 
plague our learning with poor local minima. 

We here show the mean of the estimated covariance matrices of the noise for U n (w) and 
the mean square error (MSE) cost function: 



The estimated standard deviation of the terms of the matrices are all equal to 0.003, so 
the differences observed between the terms of the two matrices are greater than twice their 
standard deviation and probably not due to chance. We can see that the estimated covariance 
of the noise is on average better with the estimator associated with the cost function U n (w). 
In particular, it seems that there is slightly less over-fitting with this estimator, and the 
non-diagonal terms are greater than with the least square estimator. As expected, the 
determinant of the mean matrix associated with U n (w) is 0.036 instead of 0.050 for the 
matrix associated with the MSE. 
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3.2. APPLICATION TO REAL TIME SERIES: POLLUTION OF OZONE 

Ozone is a reactive oxide, which is formed both in the stratosphere and troposphere. 
Near the surface of the ground, ozone is directly harmful to human health, plant life and 
damages physical materials. The population, especially in large cities and in suburban zones 
which suffer from summer smog, wants to be warned of high pollutant concentrations in 
advance. Statistical ozone modelling and more particularly regression models have been 
widely studied, see Comrie (1997), Gardner and Dorling (1998). Generally, linear models 
do not seem to capture all the complexity of this phenomenon. Thus, the use of nonlinear 
techniques is recommended to deal with ozone prediction. Here we want to predict ozone 
pollution at two sites at the same time. The sites are in the south of Paris (13th district) 
and at the top of the Eiffel Tower. As these sites are very near each other we can expect 
that the two components of the noise are highly correlated. 

The model. The neural model used in this study is autoregressive and includes exogenous 
parameters (called NARX model). Our aim is to predict the maximum level of ozone pollu- 
tion of the next day, given today's maximum level of pollution and the maximal temperature 
of the next day. If we note Y 1 the maximum level of pollution for Paris 13, Y 2 the maximum 
level of pollution for the Eiffel Tower and Temp the temperature, the model can be written 
as follows: 

(r/ +1 , Y t 2 +1 ) = F w (Y t \ If, Tem Pt+1 ) + e t+1 . (4) 

If we assume that the temperature {Temp t )t&i is a geometrically ergodic process and that 
the noise (e t ) t G N has a strictly positive density with respect to the Lebesgue measure, as 
F w is a bounded function, the stationary solution (If, F t 2 ) teN of the equation (4) will be 
geometrically ergodic and the previous theorems can easily be extended to this time series. 

As usual with real time series, over-training is a crucial problem. MLPs are very over- 
parametrised models. This occurs when the model learns the details of the noise of the 
training data. Over-trained models have very poor performance on fresh data. In this study, 
to avoid over-training, we use the Statistical Stepwise Method (SSM) pruning technique, 
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using a BIC-like information criterion (Cottrell et al. (1995)). The MLP with the minimal 
dimension is found by eliminating of the irrelevant weights in order to minimise a BIC-like 
criterion, that is to say the cost function penalised by the term q^p-, where q is the number 
of parameters of the model and n is the number of observations. Here, we will compare the 
behaviour of this method for both cost functions: the mean square error (MSE) and the 
logarithm of the determinant of the empirical covariance matrix of the noise (U n (w)). 

Dataset. This study presents the ozone concentration of the Air Quality Network of the 
He de France Region (AIRPARIF, Paris, France). The data used in this work span from 
1994 to 1997. According to the model, we have the following explicative variables: 

• The maximum temperature of the day 

• Persistence is used by introducing the previous day's ozone peak. 

Before being used in the neural network, all these data have been centred and normalised. 
The learning dataset consists of observations from 1994 to 1996. Only the months from April 
to September are used because there is no peak of ozone pollution during the winter period. 
The months from April to September of 1997 are kept for a test set, which will be used for 
evaluating models. 

The results. For the learning set, we get the following results: 



The two matrices are almost the same for the learning set, however the non-diagonal terms 
are greater for the U n (w) cost function. The best MLP for U n (w) has 13 weights, and the 




For the test set, we get the following results: 
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best MLP for the MSE cost function has 15 weights. Hence, the proposed cost function leads 
to a more parsimonious model, certainly because the pruning technique is very sensitive to 
the variance of estimated parameters. This gain is valuable regarding the generalisation 
capacity of the model, since in this way the difference is almost null for the learning data 
set but is greater for the test data. For comparison, we did the training with a one-output 
MLP to predict each level of pollution and the results match the diagonal terms of the MSE 
cost function. 

4. CONCLUSION 

In the linear multidimensional regression model without constraint the optimal estimator 
has an analytic solution and it minimises both the ordinary mean square function and U n (w), 
therefore it is not useful, for this case, to consider U n (w). However, for the constrained linear 
model and for the non-linear multidimensional regression model, the ordinary least square 
estimator is sub-optimal if the covariance matrix of the noise is not the identity matrix. We 
can overcome this difficulty by using the cost function U n (w) = log det(r n (■«;)). Indeed, this 
cost function is the same as the generalised least square cost function with the best approx- 
imation of the true covariance matrix calculable with the available data. In this paper, the 
proofs of the consistency, of the normality asymptotic and of the optimality of this estimator 
have been provided. Moreover, we have proved that, if the model is identifiable, this cost 
function leads to a simpler test to determine the number of weights. These theoretical results 
have been confirmed by a simulated example, and we have seen on a real time series that 
we can expect slight improvement, especially in model selection, because pruning techniques 
are very sensitive to the variance of the estimated weights. Nevertheless, we have to note 
that the main difficulty in regression with MLPs is the lack of theoretical justification for 
procedures determining the number of hidden units. Indeed, determining the true number of 
hidden units is very important in order to have an identifiable model. For practical situations 
and without theoretical justification, a BIC-like penalised cost function seems to work well. 

5. APPENDIX 
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Proof of theorem 1. First we have to show that the limit, as n goes to infinite, of U n (w) 
is minimised only by w°. 

Lemma 1. Under the conditions (C): 

lim U n (w) - U n (w°) >' 



and 



Proof: Let us note 



lim U n (w) - U n (w°) a =0^w = w°. 



F(w) =E((Y- F W (Z))(Y - F W {Z)) T ) (5) 



the expectation of the covariance matrix of the noise for the model parameter w and remark 
that r = T(w°). By the strong law of large numbers we have 

U n (w) - U n (w°) ^ logdet(rH) - logdet(r ) = log^gi 

= io g det (r->°) (r( w )-r ) + i d ), 

where Id denotes the identity matrix of M d . So, the lemma is true if T(w) — T is a positive 
matrix, null only if w = w°, because the determinant of (r~ 1 (w°) (T(w) — T ) + I d ) will be 
bigger than I. This is the case since 

F(w) = e ((y - F W (Z))(Y - F W (Z)) T ) 

= E((Y- F w o(Z) + F w o(Z) - F W (Z))(Y - F w o(Z) + F w o(Z) - F W (Z)) T ) 

= E((Y- F w o(Z))(Y - F w o(Z)) T ) + E {(F w o(Z) - F w (Z))(F w o(Z) - F W (Z)) T ) 

= T + E {(F w o(Z) - F w (Z))(F w o(Z) - F W (Z)) T ) . 

Then, the lemma is proved because the model is assumed to be identifiable (see equation 
(2)), so 

E {(F w o(Z) - F w (Z))(F w0 (Z) - F W (Z)) T ) , 
is a positive matrix, null only if w — w° ■ 
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From the assumptions (C), following example 19.8 of van der Vaart (1998) the set of 
functions 

w^(Y- F W (Z))(Y - F W (Z)) T , weW 

is Glivenko-Cantelli. 

Now, by lemma 1, we remark that for all neighbourhood O of w° there exists a number 
77(C) > such that for all w ^ O we have 

logdet (r(w)) > logdet (r ) + r]{0). 

In order to show the strong consistency we have to prove that for all neighbourhood O 

~ a.s. 

of w we have lim^oo W n C O, which is equivalent to 



lim logdet [T(W n ) - logdet (r ) < v(0), 
where T(W n ) is defined by equation (5). By definition, we have: 

logdet (r n (lly) <' logdet {r n (w )) . 
The Glivenko-Cantelli property and the continuity of the function T 1— > logdet(r) imply that 

lim logdet (T n (w )) - logdet (r ) =' 0, 

n^oo 

therefore 

lim logdet (r n (W n j) °< logdet (r ) + 

n— >oo V / Z 

They also imply that 

lim logdet (r n (W n )) - lim logdet (r(W n )) a = 

and finally 

lim logdet (r(W n )) - ^ <■ lim logdet (r B (^ n )) <" logdet (r ) + 

SO 

lim logdet (r(W n )) °< logdet (r ) + r](0) 
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Proof of theorem 2. As usual, the asymptotic normality of the estimator minimising 
U n (w) is a consequence of the Taylor expansion of U n (w) around the parameter w°. So, the 
computation of the first and second derivative of U n {w) is necessary to get these results. 

Let us introduce a notation: if F w (z) is a ^-dimensional parametric function depending 
on a parameter vector w, we write 9F ™^ (resp. « Fw ^ ) for the rf-dimensional vector of 
the partial derivative (resp. second order partial derivatives) of each component of F w (z). 
Moreover, if T(w) is a matrix depending on w, let us write -^T(w) the matrix of partial 
derivatives of each component of T(w). 

First derivatives. Now, if T n (w) is a matrix depending on the parameter vector w, 
we get from Magnus and Neudecker (1988) 

d 



Uogdet (T n (w)) = tr ^T-\w)-^-T n (w)^ 



dwk 

with 

TnH = - Y,(y* - F ™( z t))(Vt - F w {z t )) T . 

t=l 

Note that this matrix T n (w) and its inverse are symmetrical. Now, if we write 



1 " 



(6) 



using the fact that 

tr {T- 1 (w)A n (w k )) = tr (AZ(w k )r?(w)) = tr (T- 1 (w)Al(w k )) , 

we get 

^ logdet (r n («;)) = 2tr (T- 1 (w)A n (w k )) . 



dw k 

As we will see in an example, the calculation of this derivative is generally easy. 

Example: calculation of the derivative for an MLP. The ith component of a 
multidimensional function will be denoted F w (z t )(i) and for a matrix A = (A^), we write 
vec(A) the vector obtained by concatenation of the columns of A. Following the previous 
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results, and according to Magnus and Neudecker (1988), we can write the derivative of 
log(det(r n (w))) with respect to the weight Wk- 



logdet(r n («,)) = vec (r- 1 {w)fvec ( %M 



dw k " " n V dw k 

- the m Q ^~ r 

1 ™ 

n 



(7) 



with the matrix whose component z? is: 

Mini. * J 



n 
t=i 



Back-propagation is the standard way to compute the derivatives with an MLP function 
(see Haykin (1999)). Here, if we consider the MLP restricted to the output i, the quantity 
dFw dwk^ can ^ e com P u t e d by back-propagating the constant 1. For example, figure Q] gives 
an example of an MLP restricted to the output 2. 

Figure 1: MLP restricted to the output 2: the continuous lines 
HERE, Figure 1 

Hence, the computation of the gradient of U n (w) with respect to the parameters of the 
MLP is straightforward. We have to compute the derivative with respect to the weights of 
each single output MLP extracted from the original MLP. This computation can be done 
by back-propagating the constant value 1. Then, according to formula (8), we compute the 
derivative of each term of the empirical covariance matrix of the noise. Finally the gradient is 
obtained by the sum of all the derivative terms of the empirical covariance matrix multiplied 
by the terms of its inverse as in formula (7). 

Second derivatives. We now write 

' dF w {z t )dF w {z t ) T " 
dw k dwi 



and 



We get 



1 n f 

B n (w k ,Wi) := - I 

n t=i \ 
1 n ( 

C n (w k ,wi) := - VW ~{yt -F w (z t )) 
n z — ' \ 
t=i \ 



d 2 F w (z t f 
dw k dwi 
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2tr (^Sr^^M) + 2tr (T- 1 (w)B n (w k ,w l )) + 2tr (T^w^C^w^Wi)) . 
Now, Magnus and Neudecker (1988) give an analytic form of the derivative of an inverse 
matrix: 



SO 



(9wz, V dw 



^gg = 2tr (r-» (A n (w k ) + Al{w k )) T-\w)A n { Wk )) + 
2tr (F- 1 (w)B n (w k , Wl )) + 2tr (T~\w)C n (w k , Wl )) 
Now, theorem 2 will follow from this fundamental lemma. 



(9) 



Lemma 2. Let AU n (w°) be the gradient vector of U n (w) at u>° and HU n (w°) be the 
Hessian matrix of U n (w) at u>°. 
We finally define 

. dF w {Z)dF w {Z) T 
B(w k ,wi) : = — . 

Then, under the assumption (AN) we get: 

1. y/EAU n {w°) ^A/"(0,4/ ) 

2. #C/ n «) ^ 2/ , 

where the component (k, I) of the matrix I is: 

fr {r?E(B(w k ,w?))). 

Proof of lemma 2: Let us begin with the first result. Under the condition (AN)-l, A n (w k ) 
(see equation (6)) is square integrable, so it verifies the central limit theorem. As the fcth term 
of AU n (w°) is equal to 2tr {T- l {w°)A n {w° k )) and T~ l {w°) ^ r o \ by the Slutsky lemma (see 
lemma 2.8 of van der Vaart (1998)), AU n (w°) verifies the central limit theorem too. Now, 
let us write 
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and write '■= logd( Q^J W ^ ■ We first remark that the component (k, I) of the matrix 47 

is: 

E j yy) ) = B (2tr (r - M> . )} x 2tr (r - MK))) 

and, since the trace of the product is invariant by circular permutation, 

p / dU(w°) dU(w°) \ 
\ dw k dw\> ) 

= ± E (-^? r o \Y - F w o{Z)){Y - F w0 (Z)rr ' ( 



9F w o(Z)) 
y dw k dw t J 

= 4tr(r - 1 B(BK, tt .»))). 
This proves the first result. 

Let us now prove the second result. For the component (k, I) of the expectation of the 
Hessian matrix, we remark that 

hm A n (w° k ) = E (A(w° k )) = 

n— »oo 

because the noise e = Y — F w o(Z) is centred and independent of the random variable Z. 
Hence 

imw.tr (r->VnK)r->°)A*K)) = 

and, for the same reason 

hm trT- 1 C n (w° k ,w°) = 0. 

n^oo 

Finally 

linw, HU n (w°) = lim^ 2tr (r">°) {A n (w° k ) + Al(w° k )) r^(w°)A n (w° k )) + 
2trr- 1 (w°)B n (wlwf) + 2trT- 1 C n (wlwf) 
= 2tr (T^EiBiwlwf))) 
■ 

As the matrix I is assumed to be invertible, following the same argument of local asymptotic 
normality as in Yao (2000), we get the Taylor formula with an integral remainder: 

AU n (W n ) = AU n (w°) + [ HU n (W n + u(W n -w°)du. 

Jo 
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The condition (AN)-2 implies that 



dF Wl (z) _ dF W2 (z) 
dw k dw k 



< || Wi - w 2 \\m 2 (z) 



and 




< \\w 1 - w 2 \\m 3 (z). 



So, there exists an integrable function g ((Yi, Z\), • • • , (Y n , Z n )) such that, for all w\ and w 2 



Jo 

Finally, Theorem 2 is an obvious consequence of this last equation and lemma 2. ■ 
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It follows from this inequality that 
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